Skip to content

Conversation

filip-michalsky
Copy link
Collaborator

@filip-michalsky filip-michalsky commented Aug 12, 2025

⏺ ## Why
Add WebVoyager and GAIA evaluation suites to benchmark Stagehand's web navigation and
reasoning capabilities against industry-standard datasets.

What Changed

  • Added WebVoyager eval suite with 643 test cases for web navigation tasks
  • Added GAIA eval suite with 90 test cases for general AI assistant tasks
  • Refactored eval infrastructure to support sampling and filtering
  • Created reusable utilities for JSONL parsing and test case generation
  • Added configuration for new eval suites in evals.config.json

Environment Variables

  • EVAL_WEBVOYAGER_SAMPLE: Random sample size from WebVoyager dataset
  • EVAL_WEBVOYAGER_LIMIT: Max cases to run (default: 25)
  • EVAL_GAIA_SAMPLE: Random sample size from GAIA dataset
  • EVAL_GAIA_LIMIT: Max cases to run (default: 25)
  • EVAL_GAIA_LEVEL: Filter GAIA by difficulty level (1, 2, or 3)

Sampling Strategy

The sampling implementation uses Fisher-Yates shuffle for unbiased random selection
when SAMPLE is specified, otherwise takes the first LIMIT cases. This allows for
both deterministic (first N) and randomized (sample N) test runs.

Test Plan

# Test WebVoyager with OpenAI
EVAL_SUITE=webvoyager EVAL_WEBVOYAGER_SAMPLE=1
EVAL_MODEL=openai/gpt-4o-computer-use-preview pnpm run evals

# Test WebVoyager with Claude
EVAL_SUITE=webvoyager EVAL_WEBVOYAGER_SAMPLE=1
EVAL_MODEL=anthropic/claude-3-5-sonnet-20241022 pnpm run evals

# Test GAIA with OpenAI
EVAL_SUITE=gaia EVAL_GAIA_SAMPLE=1 EVAL_MODEL=openai/gpt-4o-computer-use-preview pnpm
 run evals

# Test GAIA with Claude
EVAL_SUITE=gaia EVAL_GAIA_SAMPLE=1 EVAL_MODEL=anthropic/claude-3-5-sonnet-20241022
pnpm run evals

# Verify existing evals still work
pnpm run evals

Copy link

changeset-bot bot commented Aug 12, 2025

🦋 Changeset detected

Latest commit: 51246f6

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@filip-michalsky filip-michalsky marked this pull request as ready for review August 15, 2025 00:52
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds two industry-standard evaluation suites to benchmark Stagehand's web automation capabilities: WebVoyager (643 test cases) and GAIA (90 test cases). The changes significantly expand the evaluation infrastructure to support data-driven benchmarking against established datasets.

The core architectural change introduces a suite-based evaluation system alongside the existing task-based approach. New suite builders (evals/suites/webvoyager.ts and evals/suites/gaia.ts) read JSONL dataset files and dynamically generate test cases, while corresponding task implementations (evals/tasks/agent/webvoyager.ts and evals/tasks/agent/webarena_gaia.ts) execute the actual evaluations. The system supports flexible sampling strategies using Fisher-Yates shuffle for randomized selection or deterministic first-N selection.

Key infrastructure improvements include:

  • A new core/summary.ts module that extracts summary generation logic into a reusable component
  • Enhanced type system with optional taskParams and params fields to pass dataset-specific parameters to evaluation functions
  • New utility functions for JSONL parsing, data validation, and sampling in evals/utils.ts
  • Environment variable configuration for controlling test execution (sample sizes, limits, difficulty levels)
  • Updated evaluation runner logic in index.eval.ts to handle both static tasks and dynamic dataset-driven evaluations

The datasets themselves are substantial additions: WebVoyager contains 643 web navigation tasks across 13+ websites (Amazon, Google services, GitHub, etc.), while GAIA provides 90 general AI assistant tasks with varying difficulty levels. Both datasets start from standardized URLs and expect structured response formats.

This integration maintains full backward compatibility with existing evaluations while providing the foundation for systematic benchmarking against industry standards. The sampling capabilities allow for both development testing (small samples) and comprehensive evaluation runs.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it maintains backward compatibility and adds well-structured evaluation capabilities
  • Score reflects solid implementation patterns and comprehensive infrastructure changes, though there's a potential division-by-zero edge case in summary generation
  • Pay close attention to evals/core/summary.ts for the division-by-zero issue in category success rate calculation

12 files reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

filip-michalsky and others added 3 commits August 20, 2025 20:08
Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add agent evaluation support to CI pipeline
@filip-michalsky
Copy link
Collaborator Author

added category to ci for external agent benchmarks:


  # Run all external benchmarks (both GAIA and WebVoyager)
  pnpm evals category external_agent_benchmarks env=LOCAL trials=1

  # Run only GAIA with max 10 test cases
  pnpm evals category external_agent_benchmarks --dataset=gaia max_k=10 env=LOCAL trials=1

  # Run only WebVoyager with max 5 test cases, 2 trials each
  pnpm evals category external_agent_benchmarks --dataset=webvoyager max_k=5 env=LOCAL trials=2

  # Backward compatible - run specific benchmark by name
  pnpm run evals name=agent/gaia api=false trials=1 max_k=10

@@ -39,6 +41,13 @@ for (const arg of rawArgs) {
}
} else if (arg.startsWith("provider=")) {
parsedArgs.provider = arg.split("=")[1]?.toLowerCase();
} else if (arg.startsWith("--dataset=")) {
parsedArgs.dataset = arg.split("=")[1]?.toLowerCase();
} else if (arg.startsWith("max_k=")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for later: we should make this arg a bit more intuitive (along max number of evals or sth)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants